Multi-agent learning via gradient ascent activity-based credit assignment

We consider the situation in which cooperating agents learn to achieve a common goal based solely on a global return that results from all agents’ behavior. The method proposed is based on taking into account the agents’ activity, which can be any additional information to help solving multi-agent decentralized learning problems. We propose a gradient ascent algorithm and assess its performance on synthetic data.


Simulation environment and parameters
In this section, we present simulation results of GAtACA-Parallel.All the Monte Carlo simulations were performed over N = 30 realizations, using the pseudo-random number generator Mersenne Twister from of library NumPy 1.11.1, with seed = 452361.Each realization, indexed by r ∈ [R], at episode e ⩾ 1 produces Y (e), [r] for any variable Y .Therefore, the mean and the standard deviation of Y over the realizations is given as r] ) r∈ [n] ).
The confidence interval (CI) at episode e are given as follow where c ≈ 2.042 is the 97.5 th percentile of a student's t-distribution of degree of freedom N − 1 = 29.The same seed is used for the two cases of X considered in this paper to guaranty the same environment for different learning objective.
We provide a comprehensive list of variables used throughout the article, along with their descriptions.

Symbol
The action policy: The total activity: The reward of action a in MAB i ν i,a The activity of agent i while taking action a in MAB i

Proofs
Proof [Theorem 1] Recall the objective function in Eq.1 with respect to = a] is a function of the environment, and is independent of the agent's policy π θ .
Note that for all (i, j) Thus, for any random variable B (1) that is independent of A (1) i under π θ □

About the convergence of Theorem 2
We replace θ by θ to solve convergence issues: the distributions π θ (e) and π θ(e) are the same when the elements of θ (e) are finite, but the limit of (π θ (e) ) e⩾1 may not be well-defined when there exists i, a and a ′ ≠ a such that θ (e) i,a and θ (e) i,a ′ tend to +∞ at the same time.In contrast, for any sequence θ(e) → θ∞ ∈ Θ, π θ(e) → π θ∞ .
An example where the gradient is zero is when #{a ∶ θi,a = −∞} = k i − 1 for i: in this case, the i th agent always makes the same choice of the arm.Any finite change of θ does not change the distribution of the decision chosen, and hence does not change E θ [X (1) ].
Note that this theorem requires the step size to tend to zero, but not too fast.This is a usual assumption in gradient descent, though how to choose (α e ) e⩾1 is a delicate issue in practice with no universal answer.In our algorithm and Equation 4, we chose to take α e constant.Despite not being covered by the above theorem, the simulations show that the algorithm does converge toward the optimal solution.
Since an intersection of nested compact connected sets is a compact connected set, and the families (⋃ e⩾N B( θ(e) , ε)) N ⩾1 and (⋂ N ⩾1 ⋃ e⩾N B( θ(e) , ε)) ε>0 are nested, compact sets, L is also a compact connected set, which concludes the proof.□ 3 Additional figures

Fig. 1 :
Fig. 1: Evolution of the multi-agent objective X = R n (left) and X = R E (right) for different values of α.The solid horizontal line corresponds to the maximum of X for each case.

Fig. 2 :
Fig. 2: Evolution of the global reward R (left) and the total activity E (right) while the objective X = R E .The solid lines correspond to the maximum of R and minimum of the activity E. While the dashed lines correspond to the evaluation of the global return R and the total activity of the most probable trajectory â under the average policy returned by the N = 30 GAtACA-Parallel realizations.

Fig. 3 :
Fig. 3: Evolution of the parameter θ i for an agent i ∈ I contributing to the global reward (agent i = 6) and an agent i ∈ [n] ∖ I that does not contribute (agent i = 30).Top: objective X = R n .Bottom: objective X = R E .